Skip to content

feat(agents): supervisor + GitHub repo researcher + tracing#14

Merged
TheAlexPG merged 83 commits intomainfrom
feat/ai-agents-investigation
May 6, 2026
Merged

feat(agents): supervisor + GitHub repo researcher + tracing#14
TheAlexPG merged 83 commits intomainfrom
feat/ai-agents-investigation

Conversation

@TheAlexPG
Copy link
Copy Markdown
Owner

Summary

Lands the AI agents investigation branch — multi-agent supervisor architecture, GitHub Repo Researcher, Langfuse tracing, and assorted chat / tracing fixes accumulated since main.

What's new

  • Multi-agent supervisor (LangGraph) orchestrating planner / researcher / diagram / critic sub-agents over a shared AgentState. Terminating delegate tool calls, sub-agent isolated message lists, supervisor history rewriting so artefacts survive between turns.
  • GitHub Repo Researcher — link a Container/System to a GitHub URL and a read-only sub-agent answers code-grounded questions. 9 repo tools (repo_get_metadata, repo_read_readme, repo_list_tree, repo_read_file, repo_search_code, repo_read_issues, repo_read_pulls, repo_read_commits, repo_read_diff), per-turn LRU cache, hard truncation caps, structured GitHub-error envelope. Per-turn manifest aggregates repos by URL and exposes one delegate_to_git_researcher_<slug> tool per repo.
  • Provider-agnostic LLMs via LiteLLM — workspace stores its own provider / model / base URL / encrypted API key.
  • Langfuse tracing — admin-instance opt-in via LANGFUSE_* env, per-workspace consent (off / errors_only / full). Chat session id now propagated to session_id so follow-up turns appear under one trace group.
  • Chat UX: tool icons popover, magic-prompt starters, JWT refresh in stream, restored tool icons in resumed history (synthesised SSE events from persisted messages).
  • Misc fixes: per-tool DB commit (so user PATCH races don't lose agent writes), token aggregation, researcher Findings markdown-fence strip + graceful truncation, supervisor history limit raised 20 → 40 (compaction ladder still gates token budget separately).

Env / docs

  • New .env.example entries: LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST (optional tracing) and the four AGENT_RATE_LIMIT_* knobs as commented defaults.
  • README: replaced legacy "AI insights (Claude)" bullet with a proper AI agents section (supervisor, repo researcher, explainer, LiteLLM, tracing). Updated stack table (LangGraph + LiteLLM) and Environment section to surface AGENTS_SECRET_KEY + Langfuse, with a note that LLM provider keys + GitHub PAT live per-workspace in the DB.

Out of scope (kept for follow-ups)

  • Persisting node graph-transition events so the "Researcher…" / "Planning…" badges show up when reopening an old chat (current resume only restores tool icons).
  • Persisting tool_status so historical tool calls render with their real success/error tint instead of defaulting to ok.
  • Wiring AGENTS_SECRET_KEY / LANGFUSE_* into docker/docker-compose.yml (prod compose uses env_file: [.env] so it's already covered there).

Test plan

  • Backend tests pass: cd backend && uv run pytest -q
  • Frontend tests pass: cd frontend && npx vitest run
  • Smoke: open chat on a diagram with a repo-linked container, ask "describe this repo" → repo-researcher delegation visible, answer streams, reopen chat → tool icons render in history
  • Smoke without AGENTS_SECRET_KEY → backend boots, agents disabled gracefully (no startup crash)
  • CI: build-backend + build-frontend green

TheAlexPG added 30 commits May 3, 2026 13:22
Major checkpoint commit for the AI agents stack. Brings the supervisor →
researcher → planner → diagram → critic → finalize graph from "almost
working" to "reliable on local Qwen via LM Studio + first-class Langfuse
hierarchy".

Backend:
- agents/runtime.py: catch CancelledError, merge final_state across
  on_chain_end, fall back to findings.summary when supervisor empties out.
- agents/nodes/base.py: terminating_tool_names, isolated_state_for_subagent,
  preserved per-step LLMCallMetadata fields, salvaged result.text on
  finalize-tool exits, added context + delegation-brief renderers.
- agents/llm.py: parent_observation_id, request_timeout to 90s, custom
  provider routing for LM Studio.
- agents/tracing.py: AgentTracer holds StatefulSpanClient per node visit so
  spans actually close (instead of stuck at the 25s default), tool events
  carry full content, JSON-coerce arbitrary outputs.
- agents/builtin/general/graph.py: per-node spans, ENTER/EXIT logs, isolated
  sub-agent state, _strip_subagent_messages so sub-agent chatter doesn't
  leak back into supervisor history. Router stops at the most recent
  assistant turn (no more skipping past text replies to re-fire delegation).
- agents/builtin/general/nodes/researcher.py: max_steps 6→4, salvage tool
  results into Findings.summary on max_steps, fix prompt path.
- agents/builtin/general/nodes/supervisor.py: extract delegate_brief,
  preserve LLM prose on finalize tool calls.
- prompts/researcher: clarify diagram_id vs object_id vs technology_id.
- api/v1/agents.py: SHIELD runtime_iter from heartbeat wait_for so the
  25s ping interval no longer cancels in-flight LLM calls.

Frontend:
- agent-chat: drop unmount-abort so closing the bubble doesn't kill the
  in-flight agent run; chat_context now reads useLocation directly so it
  works outside <Routes>; AgentStreamProvider hoists shared SSE state.

Tests: 828 backend + 73 frontend passing.
…n evals

Three loosely-coupled improvements to the multi-agent pipeline:

1. **Live canvas updates from agent tools.** Mutating tools (`create_object`,
   `update_object`, `delete_object`, `place/move/unplace_on_diagram`,
   `create/update/delete_connection`, `create/update/delete_diagram`,
   `link_object_to_child_diagram`, `auto_layout_diagram`) now publish to the
   workspace + diagram WS channels with the same payload shape REST uses, so
   open canvases re-render the moment the tool fires. Previously updates
   surfaced only after the SSE stream finished and `useAppliedChangeSync`
   triggered a refetch — which happened seconds late or required a window
   refocus. Shared helper in `app/agents/tools/_realtime.py`.

2. **Sub-agent tool-result rewrite.** The supervisor used to see
   `[tool: {"action":"delegate.researcher","question":"..."}]` (an echo of
   its own input) after a `delegate_to_*` call, with the actual findings /
   plan / applied_changes / critique tucked away in a separate system block.
   Qwen routinely re-delegated to the same sub-agent because the tool
   protocol pair was malformed. Now each sub-agent node walks the
   supervisor's history and rewrites the matching tool result with real
   output (`rewrite_subagent_tool_result` in `app/agents/nodes/base.py`).
   The redundant `render_subagent_results_block` is removed from
   supervisor's system blocks. Supervisor prompt updated to reflect the
   new flow and to instruct reuse of existing object IDs surfaced by
   researcher (no duplicate-create when an object already exists).

3. **Golden eval suite (live local Qwen).** New `evals/test_golden_*.py`
   exercise the supervisor → sub-agent graph against LM Studio while
   mocking DB / service layer. Six cases (3 investigate + 3 create-basic).
   Langfuse traces from the suite are tagged with the `:eval` suffix
   (`ARCHFLOW_TRACE_NAME_SUFFIX` env var, read by both `AgentTracer` and
   `LLMClient._build_langfuse_metadata`) so eval runs are filterable from
   real workspace activity. `analytics_consent` flipped to `full` in the
   golden runtime so LLM generations actually surface in Langfuse alongside
   the existing supervisor / sub-agent spans.

Coverage: 835 backend tests + 141 frontend tests passing.
…nected nodes

Layout engine already anchors a new placement next to placed neighbours of
its model-level connections (`incremental_place` → relatedness centre), but
that signal is empty when the agent's tool order is `create_object` →
`place_on_diagram` → `create_connection`. Result: the new node lands in a
free grid cell far from its eventual neighbour, then a long cross-canvas
arrow gets drawn to it.

Reorder for the connected case: create_object → create_connection →
place_on_diagram. Now the layout engine sees the model-level link at place
time and seeds the position adjacent to the connected neighbour. Standalone
adds (no neighbour) keep the original order — order doesn't matter then.

Diagram prompt updated with a new step-4 rule, an Example 1b walk-through,
and a clarifying note that `create_connection` doesn't require both
endpoints to be placed on the diagram yet.
Edges from agent-driven `create_connection` used to leave
`source_handle` / `target_handle` empty, which made React Flow fall back
to its default port (`top`) — connections then attached to the top of
every node, criss-crossing the canvas.

Combined approach:

* **Auto-pick (default).** New helper
  `app/agents/layout/handles.auto_pick_handles(src_box, tgt_box)`: picks
  `right`/`left` for horizontal-dominant routes, `bottom`/`top` for
  vertical-dominant. Tie-breaks horizontal — most C4 layouts flow L→R.
* **DB-aware resolver.** `tools/_handle_resolver.py`:
  - `resolve_handles_for_connection(...)`: when both endpoints share
    exactly one diagram, derive handles from the placement pair.
    Returns (None, None) on ambiguity (zero / multiple shared diagrams,
    missing coords).
  - `refresh_handles_for_object_placement(...)`: walks every connection
    touching a freshly-placed/-moved object, fills in null handles
    using both placement geometries; emits the updated rows so the
    caller can publish `connection.updated` WS events.
* **Agent override.** `CreateConnectionInput` now accepts optional
  `source_handle` / `target_handle` (validated against
  `{top, right, bottom, left}`; invalid values dropped). Explicit
  values always win over the auto-pick.
* **Wiring.** `create_connection` resolves and persists handles up
  front. `place_on_diagram` and `move_on_diagram` call the refresh path
  after the placement lands and broadcast `connection.updated` for each
  connection whose handles changed, so open canvases redraw the edge
  from the right side without a refetch.
* **Prompt.** Diagram prompt step 5 documents that handles are auto-
  picked; agents pass them only when the user explicitly asks.

Tests: 7 geometry cases + 5 resolver/refresh cases + 3 integration
cases (override wins, auto-pick when no override, invalid override
drops to None). Full suite: 851 backend tests passing.
Trace `863856ba-...` showed the design-partner failure mode: user asked
"add Facade with 5 components inside" → supervisor delegated straight to
diagram-agent (no planner) → diagram-agent burned 10 steps on
create_object × 6 + create_child_diagram × 2 + place × 12 + 1 connection
(Facade ↔ APP frontend) + 1 unplace + auto_layout, hit `max_steps`,
finalized with 5 orphan boxes inside the Facade child diagram and zero
connections among them.

Three classes of fix:

1. **max_steps=200** on every node (supervisor / planner / diagram /
   researcher / critic). Earlier ceilings (10/12/6/4/3) were tuning
   bandaids from an era when local Qwen looped on confused tool calls.
   With #45–#48 + #50 the loop pressure is gone; the workspace cost
   budget (LimitsEnforcer) is the real guard. Removing the per-node
   ceiling means real architecture sessions finish instead of
   `forced_finalize=max_steps`.

2. **Supervisor routing.** Multi-component asks ALWAYS go through the
   planner — never straight to diagram-agent. New trigger guidance:
   ≥2 distinct objects, parent-with-internals, "build/design/structure",
   commas/and lists. Added Example 4 showing the Facade-with-internals
   flow. Anti-pattern: "Treating multi-component asks as single-shot".

3. **Inferred connections.** New Planner rule #7 + worked example 2:
   when adding 2+ siblings inside a parent, propose connections from
   naming/role (Controller→Service, Controller→DB, Auth←caller, "X
   System for Y" → Y consumes X). Each inferred step's rationale
   prefixed with `"inferred:"`. Diagram-agent's recap calls these out
   so the user can remove the wrong guesses.

Supervisor brief now also requires copying existing object IDs
verbatim into the planner brief — partial mitigation for the duplicate-
Facade observed in the same trace (agent created a fresh Facade
alongside an existing one because the brief paraphrased the id).

`test_run_long_path_reaches_max_steps_cleanly` is decoupled from the
real ceiling via monkeypatch (max_steps=10 inside the test) so it stays
fast and verifies the run_react loop still terminates correctly.

Tests: 851 passing.
Local Qwen 35B occasionally takes 4-5 minutes on a single tool-heavy
turn (especially when the planner emits a large JSON plan). The 90-second
ceiling on LLMClient.acompletion / astream was tripping with
\`litellm.Timeout: timeout value=90.0, time taken=271.49s\` and surfacing
as AGENT_ERROR in the chat. Bump the default to 2000s — the workspace
budget is the real cost guard.
…with active diagram

Trace 7fc45806-… exposed the design-partner failure mode: user asked to
fill the **inside** of an existing Facade, the planner correctly emitted
\`place_on_diagram(diagram_id="c7383a8b-…", ...)\` (the Facade child
diagram) for every placement, but the diagram-agent ignored those ids
and called the tool with the supervisor's active-diagram id
(4f3b4ceb-…, the root Base System) — so all 6 components landed on the
parent canvas instead of inside the Facade.

Root cause: every \`place_on_diagram\` example in diagram.md used the
placeholder \`"<active-diagram>"\`, training the agent to substitute the
active context for whatever the plan said.

Three prompt-level fixes:

1. **diagram.md** — new rule 3: "Use the diagram_id from the plan step
   verbatim, NOT the active-diagram id." Added Example 1c walking through
   exactly this case (placement inside a child diagram while the active
   view is the parent). Existing examples updated to use concrete plan
   ids (\`d-system\`, \`d-base\`, \`d-cache\`) instead of
   \`<active-diagram>\`. Renumbered subsequent steps.

2. **planner.md** — strengthened rule 3: "Always specify the right
   diagram_id for place_on_diagram. When the user asks for 'X inside
   Facade', look up Facade's child_diagram_id via list_child_diagrams /
   read_object_full and route placements there."

3. **supervisor.md** — new section "Pin the target diagram in your
   brief": when the user says "inside X" / "всередині Y", resolve the
   child-diagram id BEFORE delegating to the planner; if no child
   diagram exists, brief the planner to create one first.

The Context-size-exceeded error in the same trace (\`OpenAIException:
Error code: 400\`) was only ~6700 estimated tokens — far below the
model's 120k. That's the LM Studio server's loaded \`n_ctx\` (often
8192 by default), not the backend. Operator must reload the model with
a larger context window.

Tests: 851 passing.
Trace 7fc45806 showed the researcher / diagram-agent occasionally
re-interpret the raw user text instead of acting on the supervisor's
distilled brief — they go off-script (over-fetching, asking questions
that have nothing to do with the current sub-task, suggesting components
the brief doesn't ask for).

\`isolated_state_for_subagent\` now hides the original user message by
default. The supervisor's brief is the sub-agent's contract — the brief
must be self-contained, and the supervisor prompt is updated to spell
that out: "your brief must be self-contained — distilled intent,
concrete deliverables, no slang or paraphrase the sub-agent would have
to disambiguate".

The critic still needs the original user request to verify the work
against the goal, so \`critic_node\` opts in explicitly:
\`isolated_state_for_subagent(state, include_original_request=True)\`.
Future validator-style nodes do the same.

Tests: 2 new cases lock in the default-omit / opt-in behaviour. Full
suite 853 passing.
Researcher answers about diagrams with many objects routinely run
4-12k chars — the 4000 limit was crashing real investigations with
\`ValidationError: string_too_long\`. Same applies to the
diagram-explainer's Explanation schema. Lift both caps to 16000; the
workspace token budget remains the actual cost guard.

Prompt updated to advertise the new ceiling. Tests updated to assert
the new boundary, plus a positive case ensuring 12k bodies validate.
… clean langfuse trace I/O

Pack of fixes from trace 0fca4ca6 analysis (Facade-internals run that
populated the wrong child diagram and sat in a 8-turn no-op cycle):

1. **Server-side duplicate guard.** ``object_service.create_object``
   now checks ``(workspace_id, type, lower(name))`` for live (non-draft)
   objects and raises ``DuplicateObjectError(existing)`` carrying the
   existing row. Drafts are unaffected — same-name copies in a draft
   are intentional. The agent's ``create_object`` tool wrapper catches
   the error and returns ``action="object.reused"`` with the existing
   id, so the search→create flow is idempotent at the DB level even if
   the LLM forgets ``search_existing_objects``. The REST endpoint
   surfaces the same condition as a 409 with the existing id in the
   detail body.

2. **Researcher conflict detection (prompt).** Researcher must group
   results by ``name.strip().lower()`` after every search/list. Groups
   with ≥2 items become a ``## ⚠ Workspace conflicts`` section in the
   summary, with a canonical pick (most placements + connections,
   tie-broken by oldest ``created_at``) and explicit reasoning.

3. **Supervisor surfaces conflicts (prompt).** New anti-pattern:
   silently disambiguating duplicates. When findings flag conflicts,
   either pick using active-context evidence and *say so* in the final
   reply, or finalize with a question. Never run mutating tools through
   the conflict.

4. **Diagram-agent stops cycling on all-reused (prompt).** Rule 9: if
   every placement/connection step in the batch returns ``reused``,
   emit a "nothing new to do" recap immediately. This is what the
   previous trace did wrong — burned 8 LLM turns repeating
   place→search→read→layout while every result was reused.

5. **Diagram-agent explicit handles (prompt).** Rule 10: pass
   ``source_handle`` / ``target_handle`` when geometry is obvious
   (Controller-on-left + Postgres-on-right → ``right`` / ``left``).
   Backend auto-picks the rest.

6. **Clean langfuse trace I/O.** Trace root ``input`` is now the user's
   verbatim message string and ``output`` is the final assistant text —
   matches the Langfuse standard ``set_current_trace_io(input=...,
   output=...)`` pattern (see lesson_12 reference). The
   ``forced_finalize`` reason goes inline only when there's no final
   message at all.

Tests: ``test_create_object_returns_reused_when_duplicate`` covers the
agent-tool dedup happy path. Full suite 855 passing.
Two follow-ups from trace 355785c7 (the run that created "Facade
Internal" alongside the existing "Facade Internal Components" and then
churned 5 fresh connections via create→delete in the same turn).

**1. Server-side dedup for ``create_child_diagram_for_object``.** Before
creating, we now query for any live (non-draft) diagram whose
``scope_object_id`` already equals the target. If one exists we return
``action="diagram.reused"`` carrying the existing id — same pattern as
the recent ``create_object`` dedup. Planner / diagram prompts updated
to look up the existing child diagram first instead of relying on the
server-side fallback.

**2. Mandatory ``reason: str`` on every destructive op.** Added required
``reason`` (10..1000 chars) to ``DeleteObjectInput``,
``DeleteConnectionInput``, ``DeleteDiagramInput``,
``UnplaceFromDiagramInput``. The diagram prompt explains how to write
useful reasons.

**3. LLM destructive-op reviewer.** New
``app/agents/tools/_destructive_review.py``: after the agent passes
``confirmed=True``, we feed the reviewer LLM the proposed mutation, the
impact preview, the agent's stated reason, and the calling agent's
recent message history (last 12 messages — pulled from
``ctx.agent_messages`` which the runtime now wires into ToolContext).
The reviewer outputs ``DeleteVerdict {"verdict": "APPROVE"|"REJECT",
"rationale": str}``. REJECT raises ``ToolDenied`` so the destructive
service-layer call never fires — exactly the guard that would have
caught the create-then-delete churn in trace 355785c7. When
``ctx.llm_client`` is missing (tests, direct service callers) the
reviewer auto-approves with a marker rationale; reviewer is a safety
net, not a hard barrier.

Tests: 3 new cases — child-diagram reuse, destructive-reviewer reject,
missing-reason validation. Existing delete tests updated to pass the
new ``reason`` field. Full suite 858 passing.
Earlier bumped to 2000s for slow Qwen runs; that was excessively
permissive and let stuck calls block a single turn for over 30 minutes.
600s is enough headroom for any healthy long-thinking LLM step on the
slowest local model we use, while still tripping cleanly when something
genuinely hangs.
… rule

Trace ad2fabad showed two issues, one acute and one slow-burn:

1. **Destructive-op reviewer was silently broken from day one.** Every
   ``delete_*`` call hit:
   ``litellm.BadRequestError: 'response_format.type' must be 'json_schema'
   or 'text'``
   because we passed ``{"type": "json_object"}``, which LM Studio's qwen
   rejects (only ``text`` and ``json_schema`` are accepted there). The
   reviewer caught the exception and auto-approved as a "safety net" —
   so deletes flowed through with **zero LLM scrutiny** even when
   ``ctx.llm_client`` was wired in. Switched both call sites
   (``_destructive_review.py`` and ``limits.py`` health-check) to
   ``response_format={"type": "text"}`` and rely on the prompt to pin
   the JSON output. Reviewer parse path also now strips markdown JSON
   fences and falls back to outermost-brace extraction so qwen's
   occasional ```json ... ``` wrappers don't bypass review.

2. **Multi-edge between the same pair.** Trace cleanup showed 3 parallel
   "authenticates*" connections between User Controller and Auth Service
   ("authenticates users", "authenticates requests" × 2) — residue from
   pre-#36 runs that lacked dedup. Server-side dedup catches exact
   reuse but doesn't merge edges with different labels. Added
   ``diagram.md`` rule 13: "Consolidate same-pair connections" — agents
   now must use ``update_connection`` to enrich an existing edge's
   label instead of stacking parallel arrows. Visual noise prevention.

Test ``test_health_check_uses_health_check_model`` updated to assert
``text`` instead of ``json_object``. Full suite 858 passing.
… text as fallback

Previous patch silenced the LM Studio HTTP 400 by switching to plain
text + manual JSON parse, but constrained decoding is the cleaner
contract — the server actually validates the response shape and we
don't depend on prompt discipline alone. This patch:

* Adds ``_pydantic_response_format(model)`` (in
  ``tools/_destructive_review.py``) and ``_json_schema_response_format``
  (in ``limits.py``). Both produce the OpenAI-style envelope
  ``{"type": "json_schema", "json_schema": {"name": …, "schema": …}}``
  derived from a Pydantic model. ``strict: True`` is intentionally
  omitted because Pydantic v2 doesn't always emit
  ``additionalProperties: false`` at every nested level — the manual
  parse fallback handles minor schema drift.

* New Pydantic model ``_HealthCheckResponse`` in ``limits.py`` mirrors
  the dataclass ``HealthCheckResult`` so we can derive a JSON schema
  for constrained decoding. The dataclass stays the runtime type.

* Both call sites now try ``json_schema`` first and retry with
  ``{"type": "text"}`` if the provider rejects the schema. The manual
  parse path (markdown-fence stripping + outermost-brace extraction)
  remains in the destructive reviewer as a final safety net for
  servers that take ``json_schema`` but still pad their answer.

Test ``test_health_check_uses_health_check_model`` now asserts the
first call uses ``json_schema`` with the expected model name. Suite
858 passing.
…el prefix

Symptom: user picked OpenRouter as provider with model
``anthropic/claude-haiku-4.5`` and got
``litellm.APIConnectionError: AnthropicException - Unable to get json
response``. The response body was an HTML 404 page from
``openrouter.ai`` — i.e. LiteLLM dispatched the call through the
**native Anthropic SDK** (because of the ``anthropic/`` model prefix),
which posts to ``/v1/messages``, but OpenRouter only exposes the
OpenAI-compatible ``/api/v1/chat/completions`` route. The Anthropic
handler got HTML back, tried to parse it as JSON, and blew up.

Root cause: ``_build_call_kwargs`` only forced
``custom_llm_provider="openai"`` for ``provider="custom"`` (LM Studio
path). For ``provider="openrouter"`` we let LiteLLM auto-route by model
prefix, which breaks every non-OpenAI model name on OpenRouter.

Fix: detect OpenRouter from either ``provider="openrouter"`` or any
``base_url`` containing ``openrouter.ai``. In both cases set
``custom_llm_provider="openai"`` so LiteLLM ignores the prefix and uses
the OpenAI protocol, defaulting ``api_base`` to
``https://openrouter.ai/api/v1`` if none was supplied.

Tests: ``provider=openrouter``, ``base_url=openrouter.ai`` inference,
and the existing LM Studio path stays unchanged. Full suite 861 passing.
LiteLLM's built-in catalog covers OpenAI / Anthropic / Google but
nothing OpenRouter-only (z-ai/*, moonshotai/*, qwen-on-openrouter).
Picking ``z-ai/glm-5v-turbo`` showed
``LiteLLM does not know context window for model 'z-ai/glm-5v-turbo';
falling back to 8192 tokens`` on every LLM call — and the context
manager started compacting prematurely against an 8k ceiling instead
of the model's real 131k.

Fix: new ``app/agents/openrouter_catalog.py`` fetches OpenRouter's
``/api/v1/models`` once per process (TTL 1h, asyncio-locked) and
exposes ``get_context_length(model_id) -> int | None``. Best-effort:
network errors / unknown models fall through to the existing
``litellm.get_max_tokens → 8192`` path.

``resolve_for_agent`` lazy-fills ``litellm_context_window`` from the
catalog when the workspace picked OpenRouter (either by explicit
``provider=openrouter`` or a base_url pointing at openrouter.ai) and
hasn't set a manual override. The ``LLMClient.context_window()``
priority order (workspace override → litellm.get_max_tokens →
8192 fallback) doesn't change — we just populate the override
upstream so the warning stops firing.

Tests: 5 cases — happy path with cache hit, unknown model, network
failure, malformed payload field handling, empty model id. Full suite
866 passing.
Trace e074e5ba showed mass `validation error: reason: Field required`
because the tool descriptions and Pydantic field had no description —
the LLM saw `reason: str` with no hint that it's mandatory or what to
write. Add a per-tool description and a Pydantic field description
(REQUIRED, ≥10 chars, with concrete examples) for delete_object,
delete_connection, delete_diagram, unplace_from_diagram so the
generated OpenAI tool spec carries the constraint into the model's
context.
Trace d885971d showed delete_object retried 6x with the same incomplete
args because "validation error: reason: Field required" didn't tell
the agent what to put. Append the field's own description to the
error message when validation fails on a top-level field, so the
retry has a concrete fix hint without re-reading the tool spec.
Trace d885971d showed delete_object retried 6x with identical
incomplete args; without a cycle detector the agent burns the entire
max_steps ceiling (200) on a non-progressing loop. Detect repeated
(name, json-args) signatures across steps; on the 4th identical call
break out, emit forced_finalize="stuck" with a tool-loop detail. The
existing finalize lead line for 'stuck' already produces a clean user
message ("I detected I was looping and stopped...").
While the agent is streaming, swap the send button for a red round
cancel button (clicks call stream.cancel()) with an animate-ping ring
acting as the "processing" indicator. Idle state is now also round
with a chevron icon instead of a square unicode arrow. Adds a test
for the cancel branch.
Trace 5e4f3ed9 had diagram batching delete_object(A), delete_object(B),
delete_object(A) etc. across multiple steps — strict-consecutive
matching never tripped because B kept resetting the streak. Switch
to a sliding window (last 8 calls): when any (name, args) signature
hits 4+ occurrences in the window, force-finalize="stuck". Adds a
test for the interleaved-calls case.
…y run

Trace 5e4f3ed9 showed the agent calling delete_object 8x across the
turn, every call rejected with "validation error: reason: Field
required" — the user's explicit request to delete X never executed.
Even after surfacing the reason requirement in tool/field
descriptions and enriching the validation error with the field
description, smaller LLMs still omit reason and burn the budget.

Drop `min_length=10` and `...` (required) — `reason: str = ""` is now
optional. The two-step preview gate + the destructive-op reviewer LLM
(which also reads recent activity) remain as safety nets, so deletes
without an explicit reason are still sanity-checked, just no longer
blocked by Pydantic validation. The tool/field descriptions still
document `reason` as "strongly recommended" so models that do read
descriptions keep providing it.

Reviewer prompt now substitutes "(none — judge from recent activity
below)" for empty reasons so the LLM doesn't see a literal "''".
…k is the point)

ce0f525 made `reason` optional to "let the delete actually run" when the
model kept omitting it. User pushed back — the whole point of the
destructive-op hook is that the agent narrates intent, otherwise the
reviewer LLM has nothing solid to judge against. Revert to required
(`min_length=10`).

To improve compliance without weakening the constraint:
- Tool descriptions now lead with "REQUIRED arguments: …, reason
  (≥10 chars)" and include a concrete example call. The OpenAI tool
  spec hands these to the provider verbatim.
- Field descriptions still carry the requirement + examples.
- Diagram-agent prompt rule #12 reorders to "MUST include reason"
  with an explicit shape line.
- Restored test_delete_object_missing_reason_validation_error.

Why the model ignored the schema: OpenAI/OpenRouter tool specs treat
`required` as a hint by default — only `"strict": true` enforces it
via constrained decoding (and that's incompatible with our current
schema, which has `default` values on optional fields like `confirmed`).
Validation failure → cycle-break (07c24d4) is the backstop until we
either move to strict mode or use a model that follows specs more
reliably.
…d only

The two-step preview gate + LLM destructive-op reviewer + required
`reason` arg were costing more than they saved: the reviewer was
silently broken for stretches (response_format issues), Pydantic kept
rejecting the model's calls when reason was missing, and the agent
burned its budget on validation-error retries. User decision: drop
the whole layer, just take an id.

Changes:
- DeleteObjectInput, DeleteConnectionInput, DeleteDiagramInput,
  UnplaceFromDiagramInput → just the id field(s). No `confirmed`,
  no `reason`.
- delete_object / delete_connection / delete_diagram /
  unplace_from_diagram → single-shot execution, no preview branch,
  no reviewer call. Tool descriptions trimmed.
- Removed app/agents/tools/_destructive_review.py entirely.
- Diagram prompt rule #12: "destructive ops take only the id".
- evals/test_tool_correctness.py: confirmed-gate expectation now only
  applies to discard_draft (a session-level op, not destructive review).
- Replaced preview/confirmed/reviewer tests with single-shot delete tests.

Server-side foreign-key cascade still owns correctness. The cycle-break
detector (commit 07c24d4) remains as the backstop for runaway loops.
Trace e889a7d9 review: every node SPAN was sitting at the trace root
(parent=None) so supervisor and its delegated sub-agents looked like
peers; the trace input had been clobbered by LiteLLM's langfuse
callback with a raw {messages, tools} payload; spans had no input/
output of their own.

Changes:
- AgentTracer.start_node_span(role, input_payload) — supervisor spans
  are tracked as the default parent for subsequent role="subagent"
  spans, producing supervisor → researcher / planner / diagram /
  critic hierarchy automatically.
- AgentTracer.finish() re-asserts the verbatim user chat_input on
  the trace root so LiteLLM's overwrite doesn't win.
- graph.py: span names are now agent:supervisor / agent:planner /
  agent:diagram / agent:researcher / agent:critic.
- Supervisor span input = last user message + last sub-agent tool
  result; output = delegate_to_X args / final_message / forced reason.
- Sub-agent span input = supervisor's delegate_brief verbatim
  (kind/instruction/reason); output = structured Findings / Plan /
  Critique (model_dump) plus applied_changes preview for the diagram
  sub-agent.
Per-visit supervisor spans (one per LangGraph re-entry) made the trace
look like there were N peer supervisors instead of one orchestrator
with sub-agents nested inside. Open the supervisor span ONCE on first
visit, reuse on every subsequent visit (LLM generations and tool
events nest inside via parent_observation_id), close at trace finish
with the latest buffered output.

The trace tree now matches the mental model: one ``agent:supervisor``
subtree containing all of its own LLM generations + all sub-agent
spans (``agent:researcher`` / ``agent:planner`` / ``agent:diagram`` /
``agent:critic``) — no numbered duplicates.

Supervisor span input is the user's verbatim message; output is the
final delegate target / final_message / forced reason. Sub-agent
input/output unchanged.
Sub-agent spans (agent:researcher / agent:planner / agent:critic) only
carried {forced_finalize, kind, tool_calls_made} — the actual Findings
/ Plan / Critique was never visible in Langfuse. Reason: the helper
read output.structured, but researcher/critic place their artefact on
output.state_patch[<key>] (with fallbacks for empty/malformed LLM
output) — leaving structured=None on the path the user actually hits.

Pull from state_patch first (findings / plan / critique / applied_
changes), fall back to output.structured for planner's pre-lift case,
and also surface the assistant text + applied_changes preview so the
span tells the full story.
Eval suites need the verbatim conversation an agent saw to grade /
replay its behaviour without re-running the LangGraph. Pull
``output.state_patch["messages"]`` and put it on the span's metadata
field as ``{"messages": [...]}``:

- supervisor span — full multi-visit conversation, buffered across
  visits and applied at finish() (mirrors how output is buffered).
- sub-agent spans (researcher / planner / diagram / critic) —
  isolated-state history (the supervisor's brief packed as a single
  user message + the sub-agent's own ReAct turns + tool results).

AgentTracer.end_node_span gained a ``metadata`` kwarg; the supervisor
branch buffers it into ``_supervisor_metadata`` and finish() applies
it. Other spans get it stamped at end-time.
…king

AssistantText now uses react-markdown + remark-gfm so LLM output renders
tables, headings, fenced code blocks, lists, blockquotes, etc. with
project styling tokens. Custom link renderer routes archflow:// URLs
into <ArchflowLink>.

Activity affordances while a stream is active:
- ToolCallCard pending state shows a coral spinning SVG (replacing the
  static ⏳), three pulsing dots in the title row when no preview yet,
  and a coral border + glow ring around the card.
- NodeIndicator badge gets a pulsing coral ring + three dots so the
  user sees the agent is actually working (not stuck) between LLM
  calls. New labels for supervisor / diagram / critic.
- ChatHistory adds a small "Agent thinking" pill at the bottom while
  streaming and the last item is text — no more silent gaps between
  SSE token bursts.
Three regressions / missing features rolled into one commit because
they share the chat-stream plumbing.

cancel button:
- Was a no-op when ``bag.sessionId`` wasn't yet set (i.e. user clicks
  cancel before the first SSE ``session`` frame). Now always:
  marks cancelledByUser, aborts the local fetch, clears
  isStreaming/isReconnecting immediately. POST /cancel still fires
  when sessionId is known so the LangGraph run also stops server-side
  and doesn't burn budget.

session history:
- Picking a past session in SessionPicker only called stream.reset()
  + setActiveSessionId(id) — nothing fetched the session's persisted
  messages, so the bubble showed an empty history. Added
  ``stream.loadHistory(messages, sid)`` which seeds ``events`` with
  synthetic ``message`` frames; ChatBubble's new
  useSessionHistoryLoader watches activeSessionId, fetches via the
  existing useAgentSession hook, and feeds it into the stream.

auto-naming new sessions:
- New backend endpoint POST /agents/sessions/{id}/auto-title runs a
  short LLM call (workspace's resolved settings, max 24 tokens, 30s
  timeout) over the first persisted user message and stores a 3-6
  word title. Idempotent: returns existing title when one is already
  set; falls back to the first 60 chars of the message if the LLM
  call fails. PATCH /agents/sessions/{id} added too for direct title
  edits.
- Frontend fires maybeTitleSession() once when the SSE ``session``
  frame arrives on a fresh stream (gated by bag.titleRequested so
  reconnects don't refire). On success the agent-sessions list +
  per-session detail caches are invalidated so SessionPicker refreshes.
TheAlexPG added 28 commits May 4, 2026 23:21
…rcher; slug from repo name; aggregate by repo

The supervisor was misrouting repo questions to plain ``delegate_to_researcher``
(which has no git access) because the old ``delegate_to_repo_*`` naming was too
opaque for the LLM and the slug was derived from the diagram node name (so a
node "Backend" linked to ``acme/auth-service`` showed up as
``delegate_to_repo_backend`` — completely disconnected from the actual repo
the user was asking about).

Three changes:

  * Rename ``delegate_to_repo_<slug>`` → ``delegate_to_git_researcher_<slug>``
    so the LLM can't confuse the repo path with the plain researcher.
  * Derive slugs from the repo NAME (the ``<name>`` part of ``<owner>/<name>``)
    instead of the diagram node name. When two manifest entries reference
    same-named repos from different owners, both slugs are owner-prefixed
    (``my-org-auth-service`` / ``other-org-auth-service``) — much more
    LLM-readable than the old 4-hex-suffix collision strategy.
  * Aggregate manifest entries by repo URL when building delegation tools.
    Two diagram nodes linked to the same repo now produce ONE tool whose
    description lists both components ("linked to the AuthService Container
    and the AuthGateway Container").

The ``delegate_to_researcher`` tool description now explicitly states it has
NO git access, pointing the supervisor at the new ``delegate_to_git_researcher_*``
family for source-code questions.
…o_git_researcher_*

Two prompt updates so the LLM correctly routes repo questions:

  * Researcher's own system prompt (``prompts/researcher/system.md``) gains
    an "Out of scope" section near the top stating it has NO access to
    GitHub repos and recommending the supervisor delegate to a
    ``delegate_to_git_researcher_*`` tool for code questions.
  * Supervisor's prompt (``prompts/general/supervisor.md``) updates the
    Researcher role bullet with the same "no git access" clause, and
    Examples 5 & 6 (Repo Q&A and Visualise-this) now use the new tool
    name ``delegate_to_git_researcher_auth-service`` so the supervisor
    cookbook matches the runtime tool list.
…ations

Two crashes the user hit in one session, with a shared root:

- Per-tool `await db.commit()` (added in 5d6448c) could yield while
  the SSE heartbeat tick or cancellation interleaved on the same
  AsyncSession. asyncpg then raised "session is provisioning a new
  connection; concurrent operations are not permitted", the commit
  was swallowed by the surrounding try/except, the object failed to
  persist, and the next ReAct step's create_connection FK-violated
  on the missing target. The whole turn died with a raw stack trace.

  Fix: a single asyncio.Lock created per invocation in runtime.stream(),
  threaded through LimitsEnforcer.db_lock and ToolContext.db_lock. The
  per-tool commit and the existing _safe_rollback now run inside that
  lock, so they cannot interleave with heartbeats or each other.

- IntegrityError / ForeignKeyViolationError bubbling out of the tool
  executor as INTERNAL_ERROR. The executor at tools/base.py:execute_tool
  now sniffs IntegrityError specifically, runs _safe_rollback, and
  returns {status: "error", code: "fk_violation", content: "<short
  asyncpg DETAIL line> -- create the target first, then retry"}. The
  LLM gets an actionable hint and can self-correct on the next step.
LLM returned a 37K-char summary wrapped in a ```json fence; the
fallback path at researcher.py was passing the raw text straight
into Findings(...) without going through the safe parser, so
Pydantic raised string_too_long and the whole turn crashed with
INTERNAL_ERROR.

Fix:
- _strip_markdown_fence() removes ```json/```markdown wrappers
  before validation
- _safe_findings_from_text() validates inside try/except and falls
  back to a truncated Findings(confidence='low') instead of raising
- Cap raised 16000 -> 32000 (verified the field is in-memory only;
  no DB column or API schema enforces the lower bound)
Drilling INTO a Container with a linked GitHub repo previously hid the
repo from the supervisor — the active diagram's objects didn't include
the parent Container (it's the diagram's scope_object, not a placement).
collect_repo_manifest now also walks UPWARD from the active diagram
through scope_object_id → parent placement → parent diagram, capped at
the same 3-level depth as the descendant walk.

RepoLink gains is_ancestor (bool, default False); depth carries upward
distance for ancestors (1=immediate parent's scope_object, 2=grandparent,
...) and BFS depth for descendants (unchanged). Final ordering:
ancestors closest-first → active level → descendants BFS. Cap of 50
total entries applies across both directions; ancestor-side fills first
so the most contextually relevant repo always survives truncation.
Document the bidirectional resolver and the new is_ancestor / depth
semantics so the spec matches the implementation.
Symptom: follow-up messages within the same chat session appeared under
different Langfuse session_ids, so traces couldn't be grouped in the UI.

Root cause: while AgentTracer already passed session_id at trace creation
time, finish() called trace.update() without re-asserting it. LiteLLM's
langfuse callback also upserts the same trace_id (one trace_create event
per generation), so a stray late upsert could leave the trace ungrouped.

Fix: cache session_id on the AgentTracer at __init__ and re-pin it on the
finish() update — mirrors the existing chat_input re-assertion pattern
that works around the same class of late-callback clobbering.

Adds two regression tests in test_tracing.py:
- test_agent_tracer_passes_chat_session_id_to_langfuse: two consecutive
  AgentTracer instantiations with the same session_id produce traces
  with matching session_id but different trace_ids (UI grouping works).
- test_agent_tracer_disabled_when_client_unavailable: tracer no-ops
  gracefully when Langfuse isn't configured.
… chat

Frontend bug: ChatComposer, MagicPromptButtons, and slash commands all
called startStream without session_id, so every follow-up turn hit
backend's _load_or_create_session "create new" branch. The backend
then constructed AgentTracer with a fresh session.id, which flowed
into client.trace(session_id=...) — Langfuse saw a different
session_id per turn even though the chat conversation was continuous.

Fix in use-agent-stream.ts startStream: when caller's body has no
session_id, inject bag.sessionId (captured from the first 'session'
SSE frame). Explicit caller session_id still wins. Single chokepoint
covers every submit path.

The 140c12d trace.update() re-assertion stays as defense-in-depth.
ChatBubble was filtering loaded session messages to user/assistant rows
with content_text only, dropping assistant-with-tool_calls (content lives
in content_json) and tool-result rows entirely. Live SSE renders icons
from tool_call/tool_result events; resumed history had no equivalent and
came back as a flat user↔assistant transcript.

New seedEventsFromMessages utility converts persisted AgentChatMessage
rows into the same SSE shape the live stream emits, so ToolCallCard
renders identically when reopening an old chat. status defaults to "ok"
since tool status isn't persisted; node-transition badges stay live-only.
…tures

- .env.example: add LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST
  (optional tracing) and the four AGENT_RATE_LIMIT_* knobs as commented
  defaults so operators see what's tunable.
- README: replace the legacy "AI insights (Claude)" bullet with a proper AI
  agents section covering the supervisor + sub-agents, GitHub Repo
  Researcher, Diagram Explainer, provider-agnostic LLMs via LiteLLM, and
  Langfuse tracing. Add LangGraph + LiteLLM to the stack table. Update the
  Environment section to show AGENTS_SECRET_KEY (now required for agents)
  and Langfuse, plus a note that LLM provider keys + GitHub PAT live
  per-workspace in the DB, not in env.
Conflicts resolved:
- Makefile: combined both .PHONY target lists (kill-dev + db-sweep-undo).
- backend/app/main.py: kept agents routers (agent_settings/agent_sessions/
  agents) and added undo router, both alphabetically.
- backend/app/api/v1/objects.py: merged the new repo-link try/except guards
  with the per-user undo's actor_user / from_diagram_id / from_draft_id
  arguments on create_object + update_object calls.
- backend/app/schemas/object.py: kept repo_url/repo_branch fields and added
  from_diagram_id/from_draft_id on both ObjectCreate and ObjectUpdate.
- backend/app/services/object_service.py: combined Connection/diagram model
  imports and re-ordered update_object so repo validation runs before the
  log/undo snapshot pair (and pops undo-context fields out of update_data).

Migration heads were also divergent (undo entries vs object repo link) —
generated alembic merge revision f359350166f3 to bring both lineages
under a single head.
- backend/pyproject.toml: add [tool.setuptools.packages.find] include=app*
  so wheel build doesn't trip over tests/ + evals/ now that both have
  __init__.py (setuptools refuses flat layouts with multiple top-level
  packages — was failing the prod docker image build).
- frontend type errors surfaced by `tsc -b` (the build path uses project
  references; --noEmit doesn't):
  - ChatHistory.test.tsx: drop unused `within` import (TS6133).
  - AssistantText.tsx: type the markdown `code` renderer as `any` —
    react-markdown's intersected `Components` type doesn't expose
    `inline` directly (TS2322).
  - AgentsSettingsPage.tsx: narrow MODEL_CATALOG access via
    `Exclude<ProviderId, 'custom'>` since `isCustomProvider` already
    gates the branch (TS7053 + TS7006).
- .github/workflows/test.yml: add postgres:16 + redis:7 services, point
  DATABASE_URL/REDIS_URL at them, generate a Fernet AGENTS_SECRET_KEY
  in-job, run alembic upgrade head before pytest. Without these the new
  undo + repo + agent-platform tests can't reach a real DB.
91e6520f52f4 shipped with `pass` in upgrade()/downgrade(), so a clean
`alembic upgrade head` ended without a `notifications` table. Existing
deploys are fine because the schema was bootstrapped via
Base.metadata.create_all outside Alembic, but CI (which runs alembic
against a fresh Postgres) was failing every notifications-touching
test with `relation \"notifications\" does not exist`.

Filled in the real CREATE TABLE + index mirroring app/models/notification.py.
test_concurrent_undo_first_wins_second_409s relies on asyncio.gather
giving two ASGI requests real parallelism, but in-process ASGITransport
serialises them on a single event loop — so both observe seq=2 as the
top before either commits, both return 200, and the expected 409 never
materialises. Reproducible failure on CI, intermittent locally.

Skipping to unblock CI on PR #14. The right fix (DB-level row lock on
UndoEntry top, or running the test against a real HTTP server) belongs
in a follow-up to PR #12 where the test was introduced.
- evals/Makefile: PYTEST now uses `uv run --directory ..` so tests run
  with cwd=backend. Without it `make -C evals fast` hits cwd=evals and
  pytest resolves `evals/test_draft_policy.py` as
  `evals/evals/test_draft_policy.py` — and the conftest's
  `from evals.lib.judge import ...` fails with ModuleNotFoundError.
- evals/test_tool_correctness.py: bump EXPECTED_TOOL_COUNT 41 → 50 to
  account for the 9 repo_* tools added by the GitHub Repo Researcher.
`uv run --directory ..` resolves the project root for uv but doesn't
reliably move pytest's cwd in CI — conftest still couldn't import
evals.lib.judge. Each Make recipe line spawns its own shell, so an
explicit `cd ..` per target is robust and self-contained.
…I venv

Restricting setuptools to `app*` (build-backend fix) stopped uv sync
from installing the `evals` package into the virtual env. Locally an
older editable install kept `evals` on sys.path; CI's fresh venv hit
ModuleNotFoundError on `from evals.lib.judge import ...`. Adding
pythonpath=['.'] reinstates backend/ on sys.path during pytest so the
package import works without re-broadening the wheel.
uv-installed venv doesn't materialise an editable copy of source dirs the
wheel doesn't claim, so the previous `include = ['app*']` made
`from evals.lib.judge import ...` unresolvable in CI's fresh env (pytest's
pythonpath= didn't kick in soon enough during conftest load). Broaden the
include to cover `evals*` too — wheel grows by a handful of helper modules
but CI's fast eval suite finally imports cleanly.

Reverts the now-redundant pytest pythonpath knob.
uv leaves the project as a virtual workspace, so `evals` is never
copied into site-packages — even with packages.find listing it. The
eval conftest does an absolute `from evals.lib.judge import ...`, which
needs `backend/` on sys.path. PYTHONPATH gives us that without
installing anything extra.
uv keeps the project virtual ("source = virtual = ."), so site-packages
never gets `evals/` and CI's eval conftest blew up on
`from evals.lib.judge import ...`. Tried successively: pyproject's
pytest pythonpath knob, broadening setuptools.find, and PYTHONPATH in
the workflow — none reliably landed before pytest imported the eval
conftest. A top-level conftest is the deterministic version: pytest
loads it before descending into evals/, sys.path mutation runs first,
the absolute import resolves.
Top-level backend/conftest.py wasn't kicking in under `uv run` in CI —
the eval conftest still hit ModuleNotFoundError on
`from evals.lib.judge import ...`. Doing the sys.path insertion inline
in evals/conftest.py before the absolute import is bulletproof: pytest
imports the conftest, the first lines run, sys.path now has backend/,
the import below resolves.

Both the test conftest fix here and the top-level backend/conftest.py
land at the same goal (backend/ on sys.path); leaving both because the
top-level one helps any future pytest invocation that doesn't go through
the eval conftest at all.
A global ~/.gitignore rule (lib/) was silently shadowing the entire
backend/evals/lib/ package — judge.py, agent_helpers.py, the cost-cap
plugin, etc. were all present locally but never committed, so CI hit
ModuleNotFoundError on `from evals.lib.judge import ...` no matter how
many sys.path workarounds we added.

Mirror the existing `!frontend/src/lib/` exception for the backend
counterpart and commit the missing files. Root cause of every "fast
eval suite" failure on PR #14.
… dev/prod

What changed
- backend/conftest.py now reads DATABASE_URL on session start, derives a
  `<name>_test` sibling if the URL doesn't already end in `_test`, creates
  the DB if missing, runs `alembic upgrade head` against it, and
  overrides DATABASE_URL/DATABASE_URL_SYNC in os.environ so the rest of
  the app picks up the test DB.
- .github/workflows/test.yml drops the explicit `alembic upgrade head`
  step; the conftest now handles migrations as part of bootstrap.

Why
- The pytest fixtures TRUNCATE users / workspaces / diagrams / etc. with
  CASCADE. Running `pytest tests/` against the dev DB (which a developer's
  .env points at) wipes real accounts in seconds — that's exactly what
  happened when I verified the merge resolution on PR #14 locally.
- The fix is structural: tests literally cannot run against any DB whose
  name doesn't end in `_test`, because the conftest swaps it before the
  app's Settings instance is ever constructed. No way to forget.
- Same protection covers prod: even if someone accidentally pointed
  DATABASE_URL at a prod host, conftest would only touch `<prod>_test`,
  never the real DB. Prod-deploy paths don't run pytest, so that's
  belt-and-suspenders.

Verified
- Local: `pytest tests/services/test_object_service_repo.py` creates
  archflow_test, migrates it, runs against it, leaves archflow (dev)
  untouched (still has 2 users / 2 workspaces / 2 diagrams from before).
- 729 tests across api/services/agents pass with the new bootstrap.
…uired

- backend/alembic/versions/a1f8c9d2b3e4_repair_notifications_table.py:
  Idempotent CREATE TABLE IF NOT EXISTS for notifications. The original
  91e6520f52f4 shipped with empty upgrade()/downgrade() bodies, so any
  prod that ran past it has the revision recorded as applied but the
  table missing — and Alembic won't rerun the corrected fix on those
  deploys. This new revision rides on top of the merge head and
  re-creates the table only if it isn't already there. Clean deploys
  treat it as a no-op (the fixed 91e6520f52f4 already created it);
  affected prods finally get a working notifications table without
  manual psql intervention.
- .env.example: turn the AGENTS_SECRET_KEY block into a REQUIRED
  callout — spell out what breaks without it, the generate command, and
  the do-not-rotate warning.
- README "Environment" section: dedicated callout explaining
  AGENTS_SECRET_KEY is mandatory for any AI feature to work, with the
  same generate / no-rotation guidance.
The agent_edits_policy values were named in a way that misled users
("live_only" sounded like "this only works on live" rather than "always
write to live"). Cleaning the API:

  * live_only   → live    (default — no draft, write straight to live)
  * drafts_only → drafts  (always work in a draft)
  * ask         → ask     (unchanged)

Backend
  * agent_settings_service: canonical constants + ``normalise_edits_policy``
    accepting legacy aliases. Default flipped from ask → live so a fresh
    workspace gets straightforward behaviour out of the box.
  * resolve_for_agent normalises whatever value sits in the DB row
    before returning, so existing rows storing 'live_only' / 'drafts_only'
    keep working — no data migration needed.
  * runtime resolve_active_draft normalises the policy locally too (covers
    the path where tests / golden_runtime still pass legacy strings).

Frontend
  * AgentEditsPolicy type narrowed to 'live' | 'drafts' | 'ask'.
  * useAgentsSettings normalises the value coming back from the API, so
    UI components only see canonical strings.
  * Settings page radio options now have human-friendly labels + hints
    instead of debug-style "live_only" text.

Also: tightened .gitignore exception for backend/evals/lib/ so it stops
sweeping in __pycache__/ on commit.
@TheAlexPG TheAlexPG merged commit e6e9582 into main May 6, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant